Load data.
df <- read_csv("e2-anonymous.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## workerId = col_character(),
## condition = col_character(),
## strategy = col_character(),
## gender = col_character(),
## age = col_character(),
## education = col_character(),
## chart_use = col_character(),
## interactions = col_character(),
## trial = col_character(),
## userInput = col_character()
## )
## See spec(...) for full column specifications.
head(df)
## # A tibble: 6 x 44
## workerId batch bonus condition duration n_data_conds n_trials strategy
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 7819bfb6 69 2.5 text 752. 18 19 I looke…
## 2 7819bfb6 69 2.5 text 752. 18 19 I looke…
## 3 7819bfb6 69 2.5 text 752. 18 19 I looke…
## 4 7819bfb6 69 2.5 text 752. 18 19 I looke…
## 5 7819bfb6 69 2.5 text 752. 18 19 I looke…
## 6 7819bfb6 69 2.5 text 752. 18 19 I looke…
## # … with 36 more variables: confidence <dbl>, gender <chr>, age <chr>,
## # education <chr>, chart_use <chr>, abs_err <dbl>, causal_support <dbl>,
## # causal_support_d <dbl>, causal_support_t <dbl>, count_GT <dbl>,
## # count_GnT <dbl>, count_nGT <dbl>, count_nGnT <dbl>, ground_truth <dbl>,
## # interactions <chr>, llA <dbl>, llB <dbl>, llC <dbl>, llD <dbl>, n <dbl>,
## # p_gene_d <dbl>, p_gene_t <dbl>, payoff <dbl>, q_idx <dbl>,
## # response_A <dbl>, response_B <dbl>, response_C <dbl>, response_D <dbl>,
## # total_GT <dbl>, total_GnT <dbl>, total_nGT <dbl>, total_nGnT <dbl>,
## # trial <chr>, trial_dur <dbl>, trial_idx <dbl>, userInput <chr>
Drop practice trial, bad batches of filtbars data, and one participant who is missing an attention check trial for some reason.
df = df %>% filter(trial != "practice")
Let’s exclude workers who miss either the trial where causal support is the largest or the trial where causal support is smallest. We define miss as responding less than 25% (i.e., less than the normative prior) when the normative response is about 100% (i.e., when signal is maxed out).
exclude_df <- df %>%
group_by(workerId) %>%
summarise(
max_trial_idx = which(trial_idx == -1)[1],
max_trial_response_D = response_D[[max_trial_idx]],
exclude = max_trial_response_D < 20
)
## `summarise()` ungrouping output (override with `.groups` argument)
head(exclude_df)
## # A tibble: 6 x 4
## workerId max_trial_idx max_trial_response_D exclude
## <chr> <int> <dbl> <lgl>
## 1 005a1c7a 10 30 FALSE
## 2 0097df14 10 100 FALSE
## 3 00d941e1 10 23.3 FALSE
## 4 02f5c174 10 20 FALSE
## 5 04579941 10 17.2 TRUE
## 6 04f4574f 10 15 TRUE
What proportion of workers are missing this attention check?
sum(exclude_df$exclude) / length(exclude_df$exclude)
## [1] 0.2617354
Apply the exclusion criteria.
df = exclude_df %>%
select(workerId, exclude) %>%
full_join(df, by = "workerId") %>%
filter(!exclude) %>%
select(-exclude)
How many participants per condition after exclusions? (target sample size was 80 per condition)
df %>%
group_by(condition) %>%
summarise(
n = length(unique(workerId))
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## condition n
## <chr> <int>
## 1 aggbars 104
## 2 bars 102
## 3 filtbars 111
## 4 icons 101
## 5 text 101
Let’s start by looking at the over pattern of absolute error. Then we’ll look at the pattern of responses vs ground truth.
Here’s the absolute error per trial, separated by sample size and visualization condition.
df %>%
ggplot(aes(x = condition, y = abs_err)) +
stat_eye() +
geom_point(position = position_jitter(), alpha = 0.5) +
theme_bw() +
facet_grid(n ~ .)
Let’s also look at the average absolute error per worker and level of sample size, separated by visualization condition.
df %>%
group_by(workerId, condition, n) %>%
summarise(
avg_abs_err = mean(abs_err)
) %>%
ungroup() %>%
ggplot(aes(x = condition, y = avg_abs_err)) +
stat_eye() +
geom_point(position = position_jitter(), alpha = 0.5) +
theme_bw() +
facet_grid(n ~ .)
## `summarise()` regrouping output by 'workerId', 'condition' (override with `.groups` argument)
Absolute error seems to get higher as sample size increases consistent with experiment 1. It looks like aggbars, bars, and icons have slightly lower average absolute error at larger sample size.
Overall, absolute error is very high. This is probably a realistic reflection of the task difficulty for causal inferences.
Let’s see if we can make better sense of the results by looking at raw responses vs the ground truth. The red line represents perfect performance.
df %>%
ggplot(aes(x = ground_truth, y = response_D)) +
geom_point(alpha = 0.4) +
geom_abline(slope = 100, intercept = 0, color = "red") +
theme_bw() +
facet_grid(n ~ condition)
We expect to see an inverse S-shaped pattern here (i.e., a linear in log odds pattern with a slope less than one, at least above ground_truth = 0.5). We see this pattern here, but the responses are noisy.
We expect a linear model to fit these data well in log odds units, so lets sanity check this intuition by looking at causal support (i.e., logit(ground_truth)) vs the log ratio of responses.
df = df %>%
mutate(
# adjust response units
response_A = if_else(
response_A > 99.5, 99.5,
if_else(
response_A < 0.5, 0.5,
as.numeric(response_A))),
response_B = if_else(
response_B > 99.5, 99.5,
if_else(
response_B < 0.5, 0.5,
as.numeric(response_B))),
response_C = if_else(
response_C > 99.5, 99.5,
if_else(
response_C < 0.5, 0.5,
as.numeric(response_C))),
response_D = if_else(
response_D > 99.5, 99.5,
if_else(
response_D < 0.5, 0.5,
as.numeric(response_D))),
# calculate log response ratio
lrr = log(response_D / 100) - log(response_A / 100 + response_B / 100 + response_C / 100),
lrr_d = log(response_B / 100 + response_D / 100) - log(response_A / 100 + response_C / 100),
lrr_t = log(response_C / 100 + response_D / 100) - log(response_A / 100 + response_B / 100)
)
df %>%
ggplot(aes(x = causal_support, y = lrr)) +
geom_point(alpha = 0.4) +
geom_abline(slope = 1, intercept = 0, color = "red") +
coord_cartesian(
xlim = c(-10, 25),
ylim = c(-10, 25)
) +
theme_bw() +
facet_grid(n ~ condition)
Let’s see that smaller sample size on it’s own axis scale.
df %>%
filter(n == 100) %>%
ggplot(aes(x = causal_support, y = lrr)) +
geom_point(alpha = 0.4) +
geom_abline(slope = 1, intercept = 0, color = "red") +
coord_cartesian(
xlim = c(-10, 10),
ylim = c(-10, 10)
) +
theme_bw() +
facet_grid(n ~ condition)
It seems clear that slopes will be less than 1 in logit-logit coordinate space at the larger sample size and closer to 1 at smaller sample size. This is what we expect to see if people are insensitive to sample size—greater underestimation error at larger sample size.
It’s hard to see other patterns (e.g., differences in slopes between conditions) without a model.
Above we’re getting a pretty good sense of how workers respond to sample size. But what about how they respond to the other signals in the visualization?
Let delta p disease be the difference in the proportion of people with disease given gene vs no gene.
Let delta p treatment be the difference in the proportion of people with disease given treatment and gene vs no gene.
df <- df %>%
mutate(
delta_p_d = (count_nGnT + count_nGT) / (total_nGnT + total_nGT) - (count_GnT + count_GT) / (total_GnT + total_GT),
delta_p_t = count_nGT / total_nGT - count_GT / total_GT
)
Let’s look at log response ratios (lrr, defined above) representing the user’s response to each of these signals.
To the extent that users are sensitive to these signals, lrrs should increase as delta p values go down
df %>%
ggplot(aes(x = delta_p_d, y = lrr_d)) +
geom_point(alpha = 0.4) +
theme_bw() +
facet_grid(n ~ condition)
df %>%
ggplot(aes(x = delta_p_t, y = lrr_t)) +
geom_point(alpha = 0.4) +
theme_bw() +
facet_grid(n ~ condition)
People seem sensitive to these signals in that the overal trend on each scatterplot seems to be a negative slope. Though their responses are not a perfect function of the differences in proportions we show them, as expected based on Griffiths and Tenenbaum (2005).
We might be able to get a better sense of users’ sensitivity by comparing lrrs to normative causal support for the corresponding effects of the gene on disease and treatment, respectively.
df %>%
ggplot(aes(x = causal_support_d, y = lrr_d)) +
geom_point(alpha = 0.4) +
geom_abline(slope = 1, intercept = 0, color = "red") +
coord_cartesian(
xlim = c(-10, 100),
ylim = c(-10, 100)
) +
theme_bw() +
facet_grid(n ~ condition)
df %>%
ggplot(aes(x = causal_support_t, y = lrr_t)) +
geom_point(alpha = 0.4) +
geom_abline(slope = 1, intercept = 0, color = "red") +
coord_cartesian(
xlim = c(-5, 25),
ylim = c(-5, 25)
) +
theme_bw() +
facet_grid(n ~ condition)
The functional relationship between causal support and lrrs for each component of the signal in these charts (i.e., the gene effects on disease and treatment, respectively) looks to be linear in log odds as expected. The overall pattern of sensitivity seems similar to users’ sensitivity to confounding, which is good to see because it means we can flexibly define causal support for inferences about different arrows or sets of arrows in the DAGs that users allocated probabilities to.
Interestingly people seem more sensitive to the impact of disease on treatment than the marginal effect of gene on the rate of disease. This is to be expected in the bars, icons, and text condition where the marginal effect is not directly shown (i.e., users must imagine aggregating across levels of treatment in the contingency table). However, in the aggbars condition, users can collapse across levels of treatment and actually see the marginal effect of gene on disease. Maybe we will see different sensitivity to this counterfactual pattern among users who interacted with aggbars? For filtbars, the effect of gene on treatment requires more clicks to investigate than the effect of gene on disease, so it is surprising to see that this users seem more sensitive to the former. We’ll did deeping into this apparent pattern when we fit linear in log odds models to our data.
Let’s see a histogram of response times per trial. These are measured in seconds.
df %>%
filter(trial_dur >= 0) %>%
ggplot(aes(x = log(trial_dur))) +
geom_histogram() +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As expected, this looks like a power law distribution.
Let’s separate this by condition to see if icons or text is taking systematically longer.
df %>%
filter(trial_dur >= 0) %>%
ggplot(aes(x = log(trial_dur))) +
geom_histogram() +
theme_bw() +
facet_grid(condition ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These distributions look similar.
What about the duration of the experiment per participant? We filter to durations within an hour because Mechanical Turk workers tend to leave browser windows open. This is a very noisy measure.
df %>%
filter(duration >= 0) %>%
filter(trial == 1) %>%
ggplot(aes(x = log(duration))) +
geom_histogram() +
theme_bw() +
facet_grid(condition ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Durations seem mostly similar. Honestly the response time data is not too interesting, but it’s worth checking.
Let’s analyze how users interacted with the aggbars and filtbars visualization conditions, respectively.
We’ll start by writing functions to reconstruct the state of each visualization on each trial based on interaction logs.
reconstruct_state_aggbars <- function(interactions) {
# starting state is conditioning on both gene and treatment
states <- list("gene_treat_init")
for(i in 1:length(interactions)) {
if (interactions[i] == "collapseRow" & str_detect(states[length(states)], "^gene_treat")) {
states <- append(states, list("treat"))
} else if (interactions[i] == "collapseRow" & states[length(states)] == "gene") {
states <- append(states, list("none"))
} else if (interactions[i] == "expandRow" & states[length(states)] == "treat") {
states <- append(states, list("gene_treat"))
} else if (interactions[i] == "expandRow" & states[length(states)] == "none") {
states <- append(states, list("gene"))
} else if (interactions[i] == "collapseCol" & str_detect(states[length(states)], "^gene_treat")) {
states <- append(states, list("gene"))
} else if (interactions[i] == "collapseCol" & states[length(states)] == "treat") {
states <- append(states, list("none"))
} else if (interactions[i] == "expandCol" & states[length(states)] == "gene") {
states <- append(states, list("gene_treat"))
} else if (interactions[i] == "expandCol" & states[length(states)] == "none") {
states <- append(states, list("treat"))
}
}
return(unlist(states))
}
reconstruct_state_filtbars <- function(interactions) {
# starting state is conditioning on nothing
states <- list("none_init")
curr <- "" # state is a chain of filters
for(i in 1:length(interactions)) {
if (interactions[i] == "clearFilters") {
curr <- ""
states <- append(states, list("none"))
} else if (str_detect(interactions[i], "^filter") & str_detect(states[length(states)], "^none")) {
# first filter
curr <- sub("^filter", "", interactions[i])
states <- append(states, list(curr))
} else if (str_detect(interactions[i], "^filter") & !str_detect(curr, paste(".*", sub("^filter", "", interactions[i]), ".*", sep = ""))) {
# only add interactions not already in the chain (don't log duplicate filters which do not change the state)
# put chain of filters including the current one into consistent order (so string matching can identify unique states)
curr <- pmap_chr(list(curr, sub("^filter", "", interactions[i])), ~paste(sort(c(...)), collapse = "_"))
states <- append(states, list(curr))
}
}
return(unlist(states))
}
Now, we’ll reconstruct the states visited on each trial for each visualization separately.
aggbars_df <- df %>%
filter(condition == "aggbars") %>%
rowwise() %>%
mutate(
interactions = str_split(interactions, "_"),
state = list(reconstruct_state_aggbars(interactions))
)
filtbars_df <- df %>%
filter(condition == "filtbars") %>%
rowwise() %>%
mutate(
interactions = str_split(interactions, "_"),
state = list(reconstruct_state_filtbars(interactions))
)
Let’s view histograms of the states visited by users of aggbars and filtbars.
aggbars_df %>%
group_by(trial, workerId) %>%
unnest(cols = c("state")) %>%
ggplot(aes(y = state)) +
geom_bar() +
theme_bw()
We can see that aggbars users create views conditioning on treatment slightly more than they create views conditioning on gene. Perhaps users collapse across levels of gene (i.e., conditioning on treatment only) as a way of investigating whether gene seems ignorable.
filtbars_df %>%
group_by(trial, workerId) %>%
unnest(cols = c("state")) %>%
ggplot(aes(y = state)) +
geom_bar() +
theme_bw()
We can see that users of filtbars tend to condition on gene often and are about as likely to condition on no gene and Gene_Treatment as they are to condition on treatment. Users condition on NoGene_Treatment slightly less often; maybe this signal is not completely necessary in order to see the impact of gene on treatment.
For both aggbars and filtbars, let’s see what proportion of users create views that should be most helpful for the task.
For aggbars this means intentionally creating views that condition on gene, which we see in about 13% of trials.
aggbars_df <- aggbars_df %>%
mutate(
condition_on_gene = any(str_detect(unlist(state), "^gene*") & !str_detect(unlist(state), "^gene_treat_init$"))
)
sum(aggbars_df$condition_on_gene) / length(aggbars_df$condition_on_gene)
## [1] 0.1310729
For aggbars, it seems like more users condition on treatment slightly more than gene. Maybe they are looking at whether grouping by or collapsing across gene makes a difference in the proportion of people with disease. Although this is kind of a roundabout way of checking for signal of a marginal effect, it probably does give an intuitive sense of whether gene seems to impact other factors. What proportion of trials to aggbars users intentionally create views that condition on treat (after seeing an initial disaggregated view conditioning on both gene and treat)? Looks like about 12% of trials.
aggbars_df <- aggbars_df %>%
mutate(
condition_on_treat = any(str_detect(unlist(state), "^treat$"))
)
sum(aggbars_df$condition_on_treat) / length(aggbars_df$condition_on_treat)
## [1] 0.1174089
For filtbars creating the most task-relevant views means intentionally creating views that condition on both gene and no gene, which we see in about 27% of trials.
filtbars_df <- filtbars_df %>%
mutate(
condition_on_gene_nogene = any(str_detect(unlist(state), "^Gene*")) & any(str_detect(unlist(state), "^NoGene*"))
)
sum(filtbars_df$condition_on_gene_nogene) / length(filtbars_df$condition_on_gene_nogene)
## [1] 0.2735894
For filtbars, it seems like more users condition on gene than no gene, and more condition on Gene_Treatment than NoGene_Treatment. Maybe they are comparing the subset of data where people have the gene to the overall data. Although this is not as rigorous as conditioning on both gene and no gene in turn, it shows task relevant signal. What proportion of trials to filtbars users intentionally create views that condition on gene? Looks like about 49% of trials.
filtbars_df <- filtbars_df %>%
mutate(
condition_on_gene = any(str_detect(unlist(state), "^Gene*"))
)
sum(filtbars_df$condition_on_gene) / length(filtbars_df$condition_on_gene)
## [1] 0.496918
Users of filtbars create the most task-relevant views of the data more often than aggbars users. They also interact with the visualization much more frequently overall.
Let’s check out out demographic variables in aggregate just to get a sense of our sample composition.
We did a free response for gender, so we’ll need to do a little lightweight text processing to generate a histogram. A few participants seem to have put their age in the gender box, so we are missing data for these folks. This categorization is not intended to be normative/prescriptive; this is just for the purpose of generating an approximate overview of gender balance in our sample.
df %>%
filter(trial == 1) %>%
rowwise() %>%
mutate(
gender = tolower(as.character(gender)),
gender = case_when(
grepl("woman", gender, fixed = TRUE) | grepl("female", gender, fixed = TRUE) ~ "woman",
(grepl("man", gender, fixed = TRUE) | grepl("male", gender, fixed = TRUE)) &
! (grepl("woman", gender, fixed = TRUE) | grepl("female", gender, fixed = TRUE)) ~ "man",
TRUE ~ "other")
) %>%
ggplot(aes(y = gender)) +
geom_bar() +
theme_bw()
Our sample is pretty gender biased, but this is a better gender balance than in E1.
Let’s also check out age.
df %>%
filter(trial == 1) %>%
ggplot(aes(y = age)) +
geom_bar() +
theme_bw()
Our sample skews toward younger people as in E1.
What about education?
df %>%
filter(trial == 1) %>%
ggplot(aes(y = education)) +
geom_bar() +
theme_bw()
Lots of college educated MTurk workers as in E1. This is probably a good thing for our study since we are studying how people do data analysis, and analysts tend to be college educated.
Last, let’s look at chart use.
df %>%
filter(trial == 1) %>%
ggplot(aes(y = chart_use)) +
geom_bar() +
theme_bw()
Education and chart use are the only demographic variables that we collected which I would expect to have an impact on performance.
Let’s check out the relationships of those variables with avg absolute error.
df %>%
group_by(education) %>%
summarise(
count = length(unique(workerId)),
workerId = list(workerId),
abs_err = list(abs_err)
) %>%
unnest(cols = c("workerId", "abs_err")) %>%
group_by(workerId, education) %>%
summarise(
avg_abs_err = mean(abs_err),
count = unique(count)
) %>%
ungroup() %>%
filter(count > 10) %>%
ggplot(aes(x = avg_abs_err, y = education)) +
stat_eyeh() +
geom_point(position = position_jitter(), alpha = 0.5) +
theme_bw()
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'workerId' (override with `.groups` argument)
df %>%
group_by(workerId, chart_use) %>%
summarise(
avg_abs_err = mean(abs_err)
) %>%
ungroup() %>%
ggplot(aes(x = avg_abs_err, y = chart_use)) +
stat_eyeh() +
geom_point(position = position_jitter(), alpha = 0.5) +
theme_bw()
## `summarise()` regrouping output by 'workerId' (override with `.groups` argument)
These conditional distributions are mostly overlapping. People with a doctorate tend to have lower absolute error, but also this is a relatively small subset of our participants such that a strong statistical inference about this may not be warranted.